PyPI Manager
Warning
The JupyterLab development team is excited to have a robust
third-party extension community. However, we do not review
third-party extensions, and some extensions may introduce security
risks or contain malicious code that runs on your machine. Moreover in order
to work, this panel needs to fetch data from web services. Do you agree to
activate this feature?
Please read the privacy policy.
Installed
No entries
Discover
No entries
Open Tabs
Kernels
Language servers
Terminals
Table of Contents
No Headings
The table of contents shows headings in notebooks and supported files.
/sandboxes/homework-5-new-tlhaksa1/
Name
...
Last Modified
File Size
- hw5.ipynb20 minutes ago1.3 MB
- README.md11 hours ago1.9 KB
- sankey.html11 hours ago3.5 MB
- stacked_bars.html12 hours ago3.5 MB
- README.md
- hw5.ipynb
- stacked_bars.html
- sankey.html
99
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[](https://classroom.github.com/a/YXOeu4qD)
1. Create a python file that webscrapes GDP by country and plots a stacked interactive bar plot using plotly. Stack countries within regions using the IMF numbers. Please include this in your ipython notebook and output your plot to an html file containing the plot.
2. Look at the chapter on interactive graphics and, specifically, the code to display a subject's MRICloud data as a sunburst plot. Do the following. Display this subject's data as a Sankey diagram. Display as many levels as you can (at least 3) for Type = 1, starting from the intracranial volume.
3. Create a simple webpage containing the Sankey graphic and host it on github pages. Do not- host this off of your assignment repo from github classroom, since this is not public. Instead, you'll have to create a new public repo from your regular github account and add this file. Put the link to your live web page in a markdown cell of your hw5.ipynb file as a text block.
Your homework should include
1. An file called hw5.ipynb that has your code for parts 1 and 2.
2. Two html files, one called sankey.html and one called stacked_bar.html that contain the two plots as html files.
3. Your hw3.ipynb file should have a text block that contains a link to the live and publicly hosted sankey diagram.
Remember that you should have two repositories for this assignment. First, you need the HW repository that you create when you accept the assignment. This should contain the three files (hw5.ipynb, sankey.html, stacked_bar.html). Secondly, you will need to create your own repository containing a live link to your sankey html file. So, when I click on that link it should show a page containing your plot.
Note plotly objects contain a method called to_html() which is useful for creating an html file.
Kernel status: Unknown
# Part 1
Part 1¶
[48]:
import requests as rq
import bs4
import pandas as pd
[50]:
Selection deleted
# read webpage into data
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
gdp = rq.get(url)
## print out the first 200 characters to see what it looks like
gdp.text[0 : 99]
[50]:
'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-l'
[51]:
# read page into bs4
bs4page = bs4.BeautifulSoup(gdp.text, 'html.parser')
tables = bs4page.find_all('table',{'class':"wikitable"})
[52]:
from io import StringIO
# Read the table from the StringIO object into pandas
# Note most recent version of pandas won't accept a string as input, it needs to be passed through stringio
gdp = pd.read_html(StringIO(str(tables[0])))[0]
# drop missing data
gdp = gdp.dropna()
# rename columns
gdp = gdp.rename(columns={'Country/Territory': 'Country','UN region' : 'Region','IMF[1][13]': 'IMF', 'World Bank[14]' : 'WorldBank', 'United Nations[15]':'UnitedNations'})
# Remove "world" data(the first row; index 0)
gdp = gdp.iloc[1:]
# Remove and rename header row
gdp.columns = gdp.iloc[0:8]
gdp.columns = ['Country', 'Region', 'IMF_Forecast', 'IMF_Year','WoldBank_Estimate','WorldBank_Year','UN_Estimate','UN_Year']
# display table
gdp.head()
[52]:
| Country | Region | IMF_Forecast | IMF_Year | WoldBank_Estimate | WorldBank_Year | UN_Estimate | UN_Year | |
|---|---|---|---|---|---|---|---|---|
| 1 | United States | Americas | 26949643 | 2023 | 25462700 | 2022 | 23315081 | 2021 |
| 2 | China | Asia | 17700899 | [n 1]2023 | 17963171 | [n 3]2022 | 17734131 | [n 1]2021 |
| 3 | Germany | Europe | 4429838 | 2023 | 4072192 | 2022 | 4259935 | 2021 |
| 4 | Japan | Asia | 4230862 | 2023 | 4231141 | 2022 | 4940878 | 2021 |
| 5 | India | Asia | 3732224 | 2023 | 3385090 | 2022 | 3201471 | 2021 |
[53]:
# Extract only the relevant columns
imf = gdp[['Country', 'Region', 'IMF_Forecast']]
[54]:
# Rename IMF column
imf = imf.rename(columns={'IMF_Forecast' : 'Forecast'})
print(imf.head())
Country Region Forecast 1 United States Americas 26949643 2 China Asia 17700899 3 Germany Europe 4429838 4 Japan Asia 4230862 5 India Asia 3732224
[55]:
# Check Forecast data type
print(imf['Forecast'].dtypes)
object
[56]:
import numpy as np
# Replace non-numeric values with NaN
imf['Forecast'] = imf['Forecast'].replace('—', np.nan)
# Convert 'Forecast' column to numeric data type
imf['Forecast'] = pd.to_numeric(imf['Forecast'], errors='coerce')
# Filter rows where 'Forecast' column is non-zero
imf = imf.loc[imf['Forecast'] != 0]
# Reset index if needed
imf.reset_index(drop=True, inplace=True)
# Show data type of 'Forecast' column
print(imf['Forecast'].dtypes)
float64
[57]:
# Convert IMF to long (unstack converts to tuple)
page = 'IMF Forecasted GDP (USD in Millions)'
y_imf = gdp[gdp['IMF_Forecast'] == page].drop(['IMF_Year','WoldBank_Estimate','WorldBank_Year','UN_Estimate','UN_Year'], axis=1).unstack()
# Check
print(imf[0:])
y = np.asarray([10, 20, 30, 40, 50])
y = y[1 : y.size] - y[0 : (y.size - 1)]
print(y)
Country Region Forecast 0 United States Americas 26949643.0 1 China Asia 17700899.0 2 Germany Europe 4429838.0 3 Japan Asia 4230862.0 4 India Asia 3732224.0 .. ... ... ... 207 Palau Oceania 267.0 208 Kiribati Oceania 246.0 209 Nauru Oceania 150.0 210 Montserrat Americas NaN 211 Tuvalu Oceania 63.0 [212 rows x 3 columns] [10 10 10 10]
[58]:
import plotly.express as px
# Plot stacked bar of countries within regions using the IMF numbers
fig = px.bar(imf, x = "Region", y = "Forecast", color = "Country", title='IMF Forecasted GDP by Country',
labels={'Forecast': 'GDP (USD Million)', 'Region': 'UN Region', 'Country': 'Country'})
fig.show()
[59]:
Selection deleted
# Save the plot as an HTML file
fig.write_html('stacked_bars.html')
html_file = 'stacked_bars.html'
# Convert the Plotly figure to an HTML string
plot_html = fig.to_html(full_html=False)
# Write the HTML string to an HTML file
with open('stacked_bars.html', 'w') as f:
f.write(plot_html)
## Might delete
Might delete¶
[22]:
# Select GDP source, drop everyrthing, convert to long (unstack converts to tuple)
y_imf = imf[imf['Forecast'] == page], (axis == 1).unstack()
# Convert from tuple to array
y_imf = np.asarray(y_imf)
# Get the first non zero entry
y_imf = y_imf[np.min(np.where(y_imf != 0)) : y_imf.size]
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[22], line 2 1 # Select GDP source, drop everyrthing, convert to long (unstack converts to tuple) ----> 2 y_imf = imf[imf['Forecast'] == page], (axis == 1).unstack() 4 # Convert from tuple to array 5 y_imf = np.asarray(y_imf) NameError: name 'axis' is not defined
[24]:
import plotly.express as px
from jinja2 import Template
#gdp = px.data.gapminder().query("Region == 'Americas','Asia','Europe','Oceania','Africa'")
px.bar(gdp, x = "Region", y = "IMF_Forecast", color = "Country", title='IMF Forecasted GDP by Country',
labels={'IMF_Forecast': 'GDP (USD Million)', 'Region': 'UN Region', 'Country': 'Country'})
stacked_bars=r"stacked_bars.html"
input_template_path = r"stacked_bars.html"
plotly_jinja_data = {"fig":fig.to_html(full_html=False)}
#consider also defining the include_plotlyjs parameter to point to an external Plotly.js as described above
with open(stacked_bars, "w", encoding="utf-8") as output_file:
with open(input_template_path) as template_file:
j2_template = Template(template_file.read())
output_file.write(j2_template.render(plotly_jinja_data))
fig.show()
# Part 2
Part 2¶
[29]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import urllib, json
import numpy as np
[30]:
# use raw file for csv
dat = pd.read_csv("https://raw.githubusercontent.com/smart-stats/ds4bio_book/main/book/assetts/kirby21.csv").drop(['Unnamed: 0'], axis = 1)
dat.head()
[30]:
| id | roi | volume | |
|---|---|---|---|
| 0 | 127 | Telencephalon_L | 531111 |
| 1 | 127 | Telencephalon_R | 543404 |
| 2 | 127 | Diencephalon_L | 9683 |
| 3 | 127 | Diencephalon_R | 9678 |
| 4 | 127 | Mesencephalon | 10268 |
[41]:
## load in the hierarchy information
url = "https://raw.githubusercontent.com/bcaffo/MRIcloudT1volumetrics/master/inst/extdata/multilevel_lookup_table.txt"
multilevel_lookup = pd.read_csv(url, sep = "\t").drop(['Level5'], axis = 1)
multilevel_lookup = multilevel_lookup.rename(columns = {
"modify" : "roi",
"modify.1" : "level4",
"modify.2" : "level3",
"modify.3" : "level2",
"modify.4" : "level1"})
multilevel_lookup = multilevel_lookup[['roi', 'level4', 'level3', 'level2', 'level1']]
multilevel_lookup.head()
[41]:
| roi | level4 | level3 | level2 | level1 | |
|---|---|---|---|---|---|
| 0 | SFG_L | SFG_L | Frontal_L | CerebralCortex_L | Telencephalon_L |
| 1 | SFG_R | SFG_R | Frontal_R | CerebralCortex_R | Telencephalon_R |
| 2 | SFG_PFC_L | SFG_L | Frontal_L | CerebralCortex_L | Telencephalon_L |
| 3 | SFG_PFC_R | SFG_R | Frontal_R | CerebralCortex_R | Telencephalon_R |
| 4 | SFG_pole_L | SFG_L | Frontal_L | CerebralCortex_L | Telencephalon_L |
[42]:
## Load in the subject data
id = 127
subjectData = pd.read_csv("https://raw.githubusercontent.com/smart-stats/ds4bio_book/main/book/assetts/kirby21AllLevels.csv")
subjectData = subjectData.loc[(subjectData.type == 1) & (subjectData.level == 5) & (subjectData.id == id)]
subjectData = subjectData[['roi', 'volume']]
## Merge the subject data with the multilevel data
subjectData = pd.merge(subjectData, multilevel_lookup, on = "roi")
subjectData = subjectData.assign(icv = "ICV")
subjectData = subjectData.assign(comp = subjectData.volume / np.sum(subjectData.volume))
subjectData.head()
[42]:
| roi | volume | level4 | level3 | level2 | level1 | icv | comp | |
|---|---|---|---|---|---|---|---|---|
| 0 | SFG_L | 12926 | SFG_L | Frontal_L | CerebralCortex_L | Telencephalon_L | ICV | 0.009350 |
| 1 | SFG_R | 10050 | SFG_R | Frontal_R | CerebralCortex_R | Telencephalon_R | ICV | 0.007270 |
| 2 | SFG_PFC_L | 12783 | SFG_L | Frontal_L | CerebralCortex_L | Telencephalon_L | ICV | 0.009247 |
| 3 | SFG_PFC_R | 11507 | SFG_R | Frontal_R | CerebralCortex_R | Telencephalon_R | ICV | 0.008324 |
| 4 | SFG_pole_L | 3078 | SFG_L | Frontal_L | CerebralCortex_L | Telencephalon_L | ICV | 0.002227 |
[43]:
fig = px.sunburst(subjectData, path=['icv', 'level1', 'level2', 'level3', 'level4', 'roi'],
values='comp', width=800, height=800)
fig.show()
[44]:
# Create Sankey diagram
fig = go.Figure(data=[go.Sankey(
node=dict(
pad=15,
thickness=20,
line=dict(color="black", width=0.5),
label=['ICV', 'Telencephalon_L', 'Telencephalon_R', 'CerebralCortex_L', 'CerebralCortex_R', 'Frontal_L', 'Frontal_R', 'SFG_L', 'SFG_R', 'SFG_PFC_L', 'SFG_PFC_R', 'SFG_pole_L'],
),
link=dict(
source=[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], # indices correspond to labels
target=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
value=subjectData['comp'].tolist()
))])
[45]:
# Update layout
fig.update_layout(title_text="Subject 127 Data as Sankey Diagram", font_size=10)
# Show plot
fig.show()
[60]:
# Save the plot as an HTML file
fig.write_html('sankey.html')
html_file = 'sankey.html'
# Convert the Plotly figure to an HTML string
plot_html = fig.to_html(full_html=False)
# Write the HTML string to an HTML file
with open('sankey.html', 'w') as f:
f.write(plot_html)
No properties to inspect.
Kernel usage not available
Switch to a notebook or console to see kernel usage details.
-
Variables
Callstack
Breakpoints
Source
9
1